{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tce3stUlHN0L"
      },
      "source": [
        "##### Copyright 2020 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "tuOe1ymfHZPu"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MfBg1C5NB3X0"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td><a target=\"_blank\" href=\"https://tensorflow.google.cn/io/tutorials/genome\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\">在 TensorFlow.org 上查看 </a></td>\n",
        "  <td><a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/io/tutorials/genome.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\">在 Google Colab 中运行 </a></td>\n",
        "  <td><a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/io/tutorials/genome.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\">在 GitHub 中查看源代码</a></td>\n",
        "      <td><a href=\"https://storage.googleapis.com/tensorflow_docs/io/docs/tutorials/genome.ipynb\">{img1下载笔记本</a></td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xHxb-dlhMIzW"
      },
      "source": [
        "## 概述\n",
        "\n",
        "本教程将演示 `tfio.genome` 软件包，其中提供了常用的基因组学 IO 功能，即读取多种基因组学文件格式，以及提供一些用于准备数据（例如，独热编码或将 Phred 质量解析为概率）的常用运算。\n",
        "\n",
        "此软件包使用 [Google Nucleus](https://github.com/google/nucleus) 库来提供一些核心功能。 "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MUXex9ctTuDB"
      },
      "source": [
        "## 设置"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IqR2PQG4ZaZ0"
      },
      "outputs": [],
      "source": [
        "try:\n",
        "  %tensorflow_version 2.x\n",
        "except Exception:\n",
        "  pass\n",
        "!pip install tensorflow-io"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bkF2WtCMaJ-3"
      },
      "outputs": [],
      "source": [
        "import tensorflow_io as tfio\n",
        "import tensorflow as tf"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6wkjlql3cOy0"
      },
      "source": [
        "## FASTQ 数据\n",
        "\n",
        "FASTQ 是一种常见的基因组学文件格式，除了基本的质量信息外，还存储序列信息。\n",
        "\n",
        "首先，让我们下载一个样本 `fastq` 文件。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yASvppCxceBu"
      },
      "outputs": [],
      "source": [
        "# Download some sample data:\n",
        "!curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3zekWXlVdprb"
      },
      "source": [
        "### 读取 FASTQ 数据\n",
        "\n",
        "现在，让我们使用 `tfio.genome.read_fastq` 读取此文件（请注意，`tf.data` API 即将发布）。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vl761cHTc7N1"
      },
      "outputs": [],
      "source": [
        "fastq_data = tfio.genome.read_fastq(filename=\"test.fastq\")\n",
        "print(fastq_data.sequences)\n",
        "print(fastq_data.raw_quality)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qxHjVKXzdx5W"
      },
      "source": [
        "如您所见，返回的 `fastq_data` 具有 `fastq_data.sequences`，后者是 fastq 文件中所有序列的字符串张量（大小可以不同）；并具有 `fastq_data.raw_quality`，其中包含与在序列中读取的每个碱基的质量有关的 Phred 编码质量信息。\n",
        "\n",
        "### 质量\n",
        "\n",
        "如有兴趣，您可以使用辅助运算将此质量信息转换为概率。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6IYxfFI4eQTM"
      },
      "outputs": [],
      "source": [
        "quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)\n",
        "print(quality.shape)\n",
        "print(quality.row_lengths().numpy())\n",
        "print(quality)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bg3wzTFzhcfS"
      },
      "source": [
        "### 独热编码\n",
        "\n",
        "您可能还需要使用独热编码器对基因组序列数据（由 `A` `T` `C` `G` 碱基组成）进行编码。有一项内置运算可以帮助编码。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oAiepmy8h32a"
      },
      "outputs": [],
      "source": [
        "print(tfio.genome.sequences_to_onehot.__doc__)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oAiepmy8h32a"
      },
      "outputs": [],
      "source": [
        "print(tfio.genome.sequences_to_onehot.__doc__)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [
        "Tce3stUlHN0L"
      ],
      "name": "genome.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
